The Bayesian Bridge 

Nicholas G. Poison 

University of Chicago* 

James G. Scott 
University of Texas at Austin 

First Version: July 2011 
This Version: August 2011 



Abstract 



We develop the Bayesian bridge estimator for regularized regression and clas- 
sification. We focus on two key mixture representations for the prior distribu- 
tion that give rise to the Bayesian bridge model: (1) a scale mixture of normals 
with respect to an alpha-stable random variable; and (2) a mixture of Bartlett- 
Fejer kernels (or triangle densities) with respect to a two-component mixture 
of gamma random variables. The first representation is a well known result 
due to West (1987). We show how it can be exploited to yield an efficient, 
parallelizable EM algorithm for calculating the joint posterior mode — even un- 
der certain non-Gaussian likelihoods, such as those that arise in logistic or 
quantile regression. The second representation is new. Its main virtue is that 
avoids the need to deal with exponentially tilted stable random variables, and 
therefore leads to a much simpler MCMC scheme than the scale mixture repre- 
sentation. It also provides insight into the multimodality of the joint posterior 
distribution, which is a notable feature of the Bayesian bridge model that is ab- 



sent under more-traditional ridge or lasso- type priors (Park and Casella, 2008 



Hans, 2009). Finally, we show how our approach can be extended to a wider 



class of non-convex regularization penalties based on Polya-type characteristic 
functions, including the double-Pareto model of Armagan et al. (2010). We 



compare the performance of the Bayesian bridge model to its classical cousin 
across a variety of data sets, both simulated and real. 



*Polson is Professor of Econonietrics and Statistics at the Chicago Booth School of Business, 
email: ngp@chicagobooth.edu. Scott is Assistant Professor of Statistics at the University of Texas 
at Austin, email: James.Scott@mccombs.utexas.edu. 
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1 Introduction 



1.1 Penalized likelihood and the Bayesian bridge 



This paper studies the Bayesian analogue of the classical bridge estimator for regres- 
sion and classification problems. The bridge is most easily understood in the multiple 
regression model, where 



y = X/3 + e, e~N(0,a2j), 

for y = . . . ,?/„)' and some unknown vector of coefficients f3 
compute the bridge estimator, solve for f3 as 



1; 



To 



^(„,,) = argminQ(/3), where Q {f3) = -\\y - Xf3\\' + uJ^lPjl" 



(2) 



for specific choices of a G (0, 1) and regularization parameter u G M^. The name 
comes from the fact that the objective function in ^ bridges an entire class of 
concave penalties, with the combinatorially hard (or best-subset-selection) penalty 
at one end, and the convex (or lasso) penalty at the other. 

As many previous authors have pointed out, the bridge naturally yield sparse 
estimates for /3, in the sense that some components of (3(^a,u) be explicitly zeroed 
out. The solution to ^ thus simultaneously accomplishes the goals of estimation and 
model selection. Most of the recent work on this estimator concerns either model- 
selection asymptotics or computation — this latter issue being especially important, in 
light of the well-documented challenges posed by non-convex optimization problems 



(1993 


), 


Huang et al. 


(2008 


), and 


Mazumder et al. 


(2009 



Our paper differs from this line of work in adopting a Bayesian approach to bridge 
estimation. Specifically, we treat p{f3 \ y) oc e~^^^^ as a posterior probability distribu- 
tion which has the solution to ^ as its global mode. This posterior arises in assuming 
a Gaussian likelihood for y, as in ([T|, along with a prior for f3 that decomposes as a 
product of exponential-power priors ( [Box and Tiao 1973): 



p{f3\a,u)cxl[exp{-\P,/Tn 



T 



-1/a 



(3) 



In pursuing this approach, we follow along the same lines as well-known Bayesian 



treatments of other classical regularization penalties, such as the ridge (Lindley and 



Smith 1972), lasso (Park and Casella 2008 Hans 2009 2010), and elastic net (Li and 



Lin, 2010; Hans, 2011). Yet there are several unique features of the Bayesian-bridge 



model that distinguish our paper from these other parallel Bayesian treatments of 
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classical penalty functions. We are especially interested in two of these features, one 
statistical and the other practical. 

First, from a statistical standpoint, the most interesting feature of the bridge 
penalty is its concavity. This is both a blessing and a curse. When the underlying 
signal is sparse, and when some further regularity conditions are met, the bridge 
penalty — precisely by virtue of its concavity — is known to dominate the lasso and 
ridge penalties when judged by a classical criterion known as the oracle property 
(Fan and Li, 2001; Huang et al. , 2008). Although the oracle property per se is of 



no particular relevance to a Bayesian treatment of the problem, it does correspond 
to a feature of certain prior distributions that Bayesians have long found important: 
namely, the property of yielding a redescending score function for the predictive 
distribution of y (Pericchi and Smith, 1992). This property is a blessing, in the 



sense that it avoids overshrinkage of large regression coefficients even in the presence 
of extreme sparsity (Poison and Scott, 2011a). Yet it is also a curse, in that it 
produces multimodality — and all its attendant computational difficulties — in the joint 
posterior distribution for f3. Indeed, we will argue that the multimodality of the 
objective function in ^ is one of the strongest arguments for pursuing a fully Bayes 
approach to bridge regression, for the simple reason that it can be misleading to 
summarize a highly multimodal surface in terms of a single point estimate, no matter 
how appealingly sparse that estimate may be. 

Second, from a practical standpoint, sampling from the full posterior under the 
Bayesian bridge model is very challenging — much more so than in the ridge, lasso, 
or elastic net models. MCMC sampling in these models, and in many similar ones 



e.g. Griffin and Brown, 2010) relies upon representing the implied prior distribution 



for (3j as a scale mixture of normals. The exponential-power prior in ^ is known to 
lie within this class of models (West, 1987). Yet the mixing distribution that arises 



in the conditional posterior is that of an exponentially tilted alpha-stable random 
variable. This is notoriously difficult to handle, owing to the lack of a closed-form 
expression for the density function. The same MCMC approach that works for the 
lasso is therefore quite difficult to apply in the case of the bridge. This fact was 



recognized by Armagan (2009), who proposed using variational methods to perform 



approximate Bayesian inference while sidestepping the challenges associated with 
stable distributions. 



1.2 Goals of the paper 

In dealing with these challenges, this paper makes three specific contributions to the 
literature on Bayesian regularized regression. 
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Figure 1: Left: triangular densities, or normalized Bartlett-Fejer kernels, of dif- 
ferent widths. Right: two examples of mixing distributions for cjj that give rise to 
exponential-power marginals for (3j in conjunction with the Bartlett-Fejer kernel. 



1. We prove the following novel mixture representation for the Bayesian bridge: 



{uj I a) 



N(X/3,(T^/) 

' -1 



l/a 



TOO 



l/a 



1 + a 1 — ex 

-—- ■ Ga(2 + l/a, 1) + -—- ■ Ga(l + l/a, 1) 



(4) 
(5) 



This representation, which recovers e~^^^^ as the marginal posterior in /3, is 
depicted in Figure [T] and explained in detail in Section |3| It is based not upon 
the theory of normal scale mixtures, but upon mixtures of Bartlett-Fejer kernels 
(which are special cases of triangle densities). It leads to a simple MCMC that 
avoids the need to deal with alpha-stable distributions, and — since it involves 
directly working with a bimodal mixing distribution — is capable of easily and 
naturally hopping between modes in the joint posterior. This solves one of the 
major outstanding difficulties of working with the bridge objective function. 
We also extend this representation to encompass a wider class of non-convex 
regularization penalties based on Polya-type characteristic functions, including 



recent proposals based on Pareto-type models (Armagan et al. , 2010). 



2. We use the Levy representation of Poison and Scott (2011b) to derive an EM al- 



gorithm based on Krylov-subspace methods for calculating the posterior mode 
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under the Bayesian bridge model. This algorithm employs a block updating 
scheme — a sharp and useful departure from traditional coordinate-descent al- 
gorithms, which are are known to encounter difficulties in highly multi-modal 
situations. Moreover, the algorithm is easily parallelized for use in ultra-high- 
dimensional data sets (see Algorithm 4). 

3. We extend this EM algorithm to provide estimates of the posterior mode under 
certain non-Gaussian likelihoods, including bridge logistic regression and bridge 
quantile regression. 



1.3 Relationship with previous work 



Within the broader class of regularized estimators in high- dimensional regression, 
there has been widespread interest in cases where the penalty function corresponds 
to a normal scale mixture. We have already mentioned the Bayesian ridge, lasso, and 



elastic net models, described in papers by Lindley and Smith (1972), Park and Casella 



(|2008j), |Li and Lin| ( |2010D , and |Hans| ( |2009| |2010[ 2011 ). Several others in this class 
include the relevance vector machine of Tipping (|2001 ); the normal/ Jeffreys model 



of Figueiredo (2003) and Bae and Mallick ( 2004[ ); the normal/exponential-gamma 
model of Griffin and Brown (2005); the normal/gamma and normal/inverse-Gaussian 



(Caron and Doucet 


2008 


Griffin and Brown 2010 


); the horseshoe prior of ( 


I!arvalho 


et al. 


( 


2010 


); the hypergeometric inverted-beta model of 


Poison and Scot1 


^ ( 


2010 


); 



and the double-Pareto model of Armagan et al. (2010). 



Since the bridge penalty is also a normal scale-mixture prior, our work naturally 
fits within this larger body of literature on Bayesian regularized regression. Yet our 
paper relies far less upon the theory of normal scale mixtures, as in the treatment of 
Armagan (2009), and instead uses the Bartlett-kernel mixture of Equation ^ and ^ 
in deriving an MCMC sampling scheme for the encompassing model. This difference 
will be elaborated further in Section |3l 

These authors repeatedly emphasize two points that will echo in the examples we 
present in Section |6} First, the solution to the penalized-likehood problem in ^ may 
produce a sparse estimator, but this estimate is provably suboptimal (in a Bayes-risk 
sense) with respect to many traditional loss functions. If, for example, one wishes 
either to estimate f3 or to predict future values of y under squared-error loss, then 
the optimal choice is to use the posterior mean, usually estimated via MCMC, in 
lieu of the mode. This conclusion is no mere tempest in a theoretical teapot. Indeed, 



both Park and Casella (2008) and Hans (2009) give realistic examples where the lasso 



posterior mean significantly outperforms the posterior mode, both in prediction and 



in estimation. Similar conclusions are reached by Efron (2009) in a parallel context. 
Our own examples provide evidence of the practical differences that arise on real data 
sets — not merely between the mean and the mode, but also between the classical mode 
and the mode of the joint distribution in the Bayesian model, marginal over over r 
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and a. (A collection of R scripts implementing all of our numerical experiments is 
available supplemental file.) 

Second, a fully Bayesian approach can often lead to very different conclusions than 
a traditional penalized-likelihood analysis, particularly regarding which components 
of (3 are important in predicting y. For example, Hans (2010) produces several 
examples where the classical lasso estimator aggressively zeroes out components of (3 
for which, according to a full Bayes analysis, there is quite a large amount of posterior 
uncertainty regarding their size. This is not to suggest that one conclusion is right, 
and the other wrong, in any specific setting — merely that the two conclusions can be 
quite different, and that practitioners are well served by having both at hand. 



2 The usual scale-mixture representation 



The traditional construction of the exponential-power prior involves the following 
mixture: 

/•oo 

kexpi-\tn= 0(t|O,A-i) A-i/Va/2(A), (6) 

^0 

where A; is a normalization constant; 0(2; | m,v) is the normal density function with 
mean m and variance v; and fb{t) is the density of a positive alpha-stable random 
variable with index of stability b. West (1987) proves the result when 1 < 



a 



< 2, 

while Poison and Scott (2011b) appeal to an argument involving moment-generating 
functions of Levy processes to extend the result for all < a < 2. This involves 
treating the unnormalized density directly as the moment-generating function of a 
subordinator (or strictly increasing Levy process), evaluated at t^/2: 



E(e 



(7) 



where the expectation is with respect to /a/2(A). This avoids the extra term of A~^/^ 
and greatly simplifies the analysis. 

Similar mixture representations have been exploited to yield simple MCMC al- 
gorithms for a wide variety of models, including many of those referenced in the 
introduction. Armagan (2009) also uses ^ to derive a variational-EM approach for 
approximate Bayesian analysis. 

Unlike these other models, however, the mixture representation for the bridge is 
less than hospitable to full Bayes analysis via MCMC. To see the difficulty, observe 



6 



that ([7]) leads to the following latent-variable representation of the joint posterior: 
A I y) = Cexp 

= C exp 

where A = diag(Ai, . . . , Aj). The conditional posterior of Xj given (3j is then an 
exponentially tilted stable random variable, leading to a Gibbs sampler for p{/3, A \ y) 
involving two steps. 

1. Generate p{f3 \ A,y). This is a multivariate normal draw with 

(/3 I A, z/, y) ~ A/" (^{a-^X'X + 2u^A)-^ ■ a'^X'y, a\a~^X'X + 2u^A)-^^ . 



u'/"f3'Af3 - ^^(3'X'X(3 + (3'a-'X'y] J] U/,{X,) 
{-\f3' {a-^X'X + 2z/2/°A) /3 + fi'a-^X'yX \[ U/,{X,) , (8) 



2. Generate p{A \ /3, z/, y) component-wise, with each component having an expo- 
nentially tilted stable conditional distribution: 



-i.5|/3,|2A, 



M I fl ^ " '/a/2(A 



The difficulty here is that neither the prior density fa/2{Xj), nor the posterior density 
just given, are known in closed form, and can be only be given explicitly in terms of 
integral or series representations. Any efficient sampling scheme must therefore avoid 
evaluating both the prior and posterior density functions for Xj. 



Godsill (2000) studied a similar problem and gives three options for sampling 
from this class of distributions. The first two use the alpha-stable prior fa/2{Xj) as a 
proposal distribution within a rejection sampler and a Metropolis-Hastings sampler, 
respectively. The third option involves an alternative accept/reject algorithm within 
an overall slice sampler. Unfortunately, as Godsill (2000) writes, all three methods 
have significant practical limitations: "The problem is that all three schemes rely on 
sampling from the prior distribution . . . which may not be well attuned to the poste- 
rior distribution itself. Hence rejection rates and convergence rates can be very poor 
in application to both real and simulated data." The paper goes on to recommend 
an alternative approach involving a power-series expansions of the tails of the stable 
distribution for use in practical applications. But this must also be tuned to different 



configurations of the prior vis-a-vis the region where the likelihood is largest. Devroye 



(1996) describes an alternative method for sampling from exponentially tilted stables. 
But we have encountered difficulties in implementing this method, since it requires 
evaluating the hyperbolic cosine function for sometimes-extreme arguments, leading 
to numerical instability. 
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Despite these difficulties, the Levy representation of the bridge penalty is still 
very useful: as we will show, it leads to fast, efficient EM algorithms for calculating 
posterior modes. These algorithms do not require evaluating the density of a sta- 
ble random variable; offer a parallelizable block updating scheme (unlike traditional 
coordinate-descent methods); and can be extended to many non-Gaussian likelihoods. 
This class of algorithms is described in detail in Section [5j 

3 Mixtures of Bartlett— Fejer kernels 

We now describe an alternative approach for sampling from the Bayesian-bridge pos- 
terior distribution. This approach involves a counter-intuitive step: abandoning the 
normal scale-mixture framework entirely, in favor of a representation that uses mix- 
tures of Bartlett-Fejer kernels. Compared with the above approach involving stable 
distributions, the Bartlett-kernel approach has three crucial advantages within the 
context of MCMC sampling: it is simple to program; it requires no additional tuning; 
and it generalizes to a much wider class of models that can be sampled using a similar 
scheme. It therefore represents a promising new approach for generating regulariza- 
tion penalties with tractable Bayesian analogues. The representation is also highly 
intuitive, in that it allows one to see precisely how two salient features of the bridge 
posterior — its non-differentiable point at zero, and its multimodality — arise from the 
prior. 

We now describe this approach in further detail. The following theorem describes 
a class of priors that can be represented as mixtures of Bartlett-Fejer kernels where 
the mixing distribution is explicitly known. It relies upon the useful fact that cer- 
tain characteristic functions of Polya type can be reinterpreted as symmetric density 
functions on the real line. 

Theorem 3.1. Let f{t) he a function that is symmetric about the origin; integrable, 
convex, and twice- differentiable on (0, oo); and for which /(O) = 1, andlimt^ao f{(3) = 
0. Let k = {2 Jq°° f{t) dt}~^ denote the normalizing constant that makes f{t) a density 
on the real line. Then f is the following mixture of Bartlett-Fejer kernels: 



where a+ = max(a, 0). 

As an immediate corollary, we have the following mixture representation for the 
exponential power density. 

Corollary 3.2. The exponential power density, a G (0,1], is mixture of Bartlett- 
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Fejer kernels, where 



a 



2rr(l/a) 



exp(-|/3/rn) 



p{u I a) 





1 + a 



l/a 



/3 






1. 









I a) du (10) 
caw^/^e"^ , (11) 



where Ci anc? C2 are i/ie normalizing constants of their respective gamma densities. 

The Bayesian bridge prior can therefore be written using the simple two- component 
mixture of gamma random variables from Equat ions ([4|) and ([sj) an d Figure [l] Note 
that this family includes the Bayesian lasso of Park and Casella (2008) and iHans 



(2009), which is seen to be a simple gamma mixture of triangle densities (since the 



second mixture component drops out when a = 1). 

4 MCMC sampling for the Bayesian bridge 
4.1 Sampling f3 and the latent variables 

To see why the representation in (|4])-([5]) leads to such a simple algorithm for posterior 
sampling, consider the joint distribution for f3 and the latent u/s: 



p{f3,Q \T,y)= Cexp l-—f3'X'Xf3+^f3'X'yj l[p{u, | a) J] 1 

^ ^ i=l i=l \ 



1/3. 



l/a 



:i2) 



Introduce further shce variables Ui, . . . ,Uj. This leads to the joint posterior 



p{f3,n,u\T,y) oc exp{-^f3'X'Xf3+^f3'X'y 



p 



p 



X JJp(wj I a) JJl I < < 1 



1/3. 



TLO 



l/a 



(13) 



where I(-) is the indicator function. Note that we have implicitly absorbed a factor 
of co^^" from the normalization constant for the Bartlett-Fejer kernel into the gamma 
conditional for Uj. This will make inverting the slice region for uj far easier. 

Applying Corollary 3.2, if we marginalize out both the slice variables and the 
latent uj/s, we recover the Bayesian bridge posterior distribution, 

pif3 I y) = Cexp (^-^||y - X/3f - |/3,/rr j . 
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We can invert the slice region in (13) by defining ( Clj , 6j ^ clS 



<r \l- Uj)uj/'' = bj and > 

This leads us to an exact Gibbs sampler that starts at initial guesses for (/3, Q) and 
iterates the following steps: 

1. Generate {uj \ f3j,ijJj) ~ Unif ^0, 1 — T\f3j\ujj ^^"j . 

2. Generate each Uj from a mixture of truncated gammas, as described below. 

3. Generate f3 from a truncated multivariate normal proportional to 

N (^,a\X'X)-^)l{\(3j\ < bj for all j) , 



where f3 indicates the least-squares estimate for /3. This step can be done 



component- wise in Gibbs fashion. See Devroye (1986), Geweke (1991), and 



Chopin (2011 ) for fast rejection-sampling schemes for one-dimensional truncated 



normals. 

The conditional posterior of the latent w/s can be determined as follows. Sup- 



pressing subscripts for the moment, it is clear from (13) that 



p{u I a) = a{(jje + (1 — a)e 
p{u\a,a) = Ca{a{ue-'^) + {l-a)e-'^}l{u>a) 



where a comes from inverting the slice region in (13) and Ca is an appropriate nor- 
malization constant. 

We can simulate from this mixture of truncated gammas by defining u = u — a, 
where uj > 0. Then uj has density 

p{u\a, a) = Ca {ae'^ia + w)e"'^ + (1 - a)e~"e~'^} 



1 — aa i — aa 

This is a mixture of gammas, where 

r(l, 1) with prob '~"^^+"^ 



(a; I a) ~ 



r(2, 1) with prob 



l—aa 
a 

l—aa 



After sampling u, simply transform back using the fact that u = a + u. 

This representation has two interesting and intuitive features. First, full condi- 
tional for f3 in step 3 is centered at the usual least-squares estimate /3. Only the 
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Algorithm 1: Gibbs sampling for Bayesian bridge regression 
For variable j — 1, . . . ,p 

1. Sample {uj \ f3j,u!j) ~ Unif ^0, 1 — T\f3j\u!j ^^"^ . 

2. Set 



1-Uj 



3. Sample Uj in two steps: 
(a) Sample 



Ga(l,l) withprob 

Ga(2, 1) withprob . 



(b) Set uuj — uuj + Qj. 

4. Generate f3 from a truncated multivariate normal proportional 
to 

N (^P,a^{X'X)-^^l{\(3j\ < bj for all j) , 
where ^ indicates the least-squares estimate for f3. 
Update hypcrparameters (r, a, a) as required. 

Figure 2: A Gibbs-samphng algorithm for the Bayesian bridge model. 
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truncations [bj) change at each step, which speeds matrix operations. Compare this 
to the usual scale-mixture representation, which involves inverting a matrix of the 
form {X'X + A)-^ at every MCMC step. 

Second, the mixture-of-gammas form of p{u)) naturally accounts for the bimodality 
in the marginal posterior distribution, p{f3j | y) = / p{f3j \ ijj,y)p{ujj \ y)dujj. Each 
mixture component of the conditional for Uj represents a distinct mode of the marginal 
posterior for 13 j. As the examples later will show, this endows the algorithm with the 
ability to explore various modes of the joint posterior very easily. 

We have summarized the algorithm in Figure |2} 



4.2 Sampling hyperparameters 

To update the global scale parameter r, we work directly with the exponential-power 
density, marginalizing out the latent variables {uj,Uj}. From (|2]), observe that the 
posterior for u = r"°, given /3, is conditionally independent of y, and takes the form 

p 

p{u I f3) oc i^*'/"exp(-i/ p{u) . 



Therefore if u has a Gamma(c, d) prior, its conditional posterior will also be a gamma 
distribution, with hyperparameters c* = c + p/a and d* = d + If^jl""- To sample 
r, simply draw u from this gamma distribution, and use the transformation r = z/^^/°. 
Alternative priors for u can also be considered, in which case the gamma form of the 
conditional likelihood in u will make for a useful proposal distribution that closely 
approximates the posterior. 

In many cases the concavity parameter a will be fixed ahead of time to reflect a 
particular desired shape of the penalty function. But in principle it too can be give 
a prior p{a). From (12), the conditional posterior in a would then be 



p{a\n,T,i3)^l[\l 



TOO 



1/a 



p{ujj I a) p{a) , 



which can be updated using, for example, a random-walk Metropolis sampler. 



5 Computing MAP estimates 
5.1 In the usual regression problem 

We now show — first for the usual regression problem, and then for logistic and quantile 
regression — that the mode of the marginal posterior distribution for (3 under the 
Bayesian bridge can be found using a simple EM algorithm, using the tilted, iteratively 
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reweighted least-squares (TIRLS) algorithm of Poison and Scott (2011c). 

We represent the conditional posterior distribution for f3 in two different ways in 
order to derive the E and the M steps. First, begin with joint posterior distribution 
for /3 and A in the normal scale-mixture representation of (|8]). Taking the natural log 
and factorizing as a function /3, we have 

^ n ^ k 

logp(/3 I A, r, y) = c,{A, Y, r) - - ^ {y, - x',/?)' " ^ E ^^^J ' 

i=i ^ j=i 

where Ci does not depend upon (3. 

We treat the above expression as the complete-data log posterior, with A as miss- 
ing data. The E step involves taking the conditional expectation of this expression, 
given some current guess f3'^^\ Since the complete-data log posterior depends linearly 
upon each Aj, 

n ^ k 

E{logp(/3 I A, r, y) | (3^^^) = -- (y. - " ^ E ' (14) 

where Xf = E(Aj | (i'f^ , r) is the conditional moment of Aj. 

To derive these conditional moments, recall from ([T]) that the exponential power 
density can be written normal scale mixture: 



K)= / 0(/3, |0,r2ATi)p(A,.)dA 



■3 ' 



where p{\j) is a power-tilted stable distribution. 
Since (/> is a normal kernel, 
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Differentiating under the integral sign, dividing through by p(/3j|r), and using the 
above identity for the inner function, we obtain the following expression: 

Ap(^,|r)^-|E(A,|^f,.). 

Therefore, the E step is simply to plug 

^ d{-logp{pf\r) 
a/3, 
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Algorithm 2: EM for Bayesian bridge 


linear regression 


Given a starting value f3^^^ and a threshold Xmax 


(e.g. 109): 


For iteration g = 1,2, . . . 




E Step: Set 

For all Xj > Xmax: delete and f3j from the model, along with the 
jth column of X. 

M Step: Solve A^9)f3{9+^) ^ ^ig)^ ^here 


h^9) X'y 


-X'X 


for A(5) = diag(AS^\...,A?^). 




End when the sequence of estimates . 


. .} has converged. 



Figure 3: An EM algorithm for MAP estimation in the Bayesian bridge model. 
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into (14). Notice that it is unnecessary to know the exphcit form of the mixing 
measure. Only the derivative of the negative log-prior for Pj (that is, the exponential 
power kernel) is required. 

To derive the M step, we return to ([s]). Collecting terms, we can represent the log 
posterior (up to an additive term not involving (3) as a sum of quadratic forms: 

logp(/3 I A, r, y) = (y - X(3)' {y - X(3) - ^f3'Af3 • 



By using the expressions given in, for example, Lindley and Smith (1972), we arrive 
at the fact that the M step involves choosing (3^^^^^ as the solution to the following 
linear system: 

(X'X + r-2A(f))^^'+'^=X'y. 

We summarize the resulting EM algorithm in Figure [3] One caveat here is that for 
~ 0, the conditional moment in the E step diverges to infinity, and care must be 
taken. It is important to emphasize here that infinite values for Aj do not reflect poor 
numerical behavior, but rather are the result of a normally functioning algorithm that 
has found a sparse solution. The numerical difficulty this poses can be easily handled 
by starting the algorithm from a value where /3 has no zeros, and then removing /3j 
from the model when it gets within a small numerical threshold of its prior mean (or 
equivalently, when A-,- > \max, e.g. 10^. This conveys the added benefit of hastening 
the matrix computations in the weighted least-squares (M) step. Although we have 
found this approach to work well in practice, it has the disadvantage that a variable 
cannot re-enter the model once it has been deleted. An alternate approach that avoids 
this difficulty involves the use of restricted least-squares; for details, see Section 3.2 



of Poison and Scott (2011d). 



5.2 In logistic and quant ile regression 

A similar EM algorithm can be used to find the MAP estimate for bridge-penalized 
logistic regression (LR) and quantile regression (QR): 

{n p 
^log(l + exp{-y,a;'i/3}) + |/3j/r|" 

{n p 
^{IVi - x[(3\ + {2q - l){yi - x./3)} + ^ |/3j/r| 
i=i j=i 

where binary outcomes are encoded as ±1, and where q G (0, 1) is the desired quantile. 
To compute these two estimators, we make use of the following identities concern- 
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Algorithm 3: EM for bridge quantile and logistic regression 

Given a starting value P^^^ and a threshold Xmax (e.g. 10^): 
For iteration g — 1,2,... 



E Step: Set 



if - 



y.xfp(^> \ l+exp[j/ia;f /3(f)] 2 / l^^BH^; 

l^/i -xf/3^^)|-^ (quantile) 



For all Xj > X^ax, delete Xj and (3j from the model, along with the 
jth column of X. 

M Step: Solve A^s)f3i9+^) ^ 1,(9)^ ^here 



1,(9) 



T-2A(3) + X^r(s)X^ (logit) 

r-2A(s) + xr^9)x (quantile) 
0.5X^^1 (logit) 

X' (r(s)y - [1 - 2g]l) (quantile) 



where A^^) = diag(AS^\ . . . , A^^^) and F^s) = diag(7i^\ . . . ,7^^^); 
where 1 is a column vector of ones; and where has rows y^Xj, 
recalling that binary responses are encoded as |/i = ±1. 



End when tlie so(iuence of estiinatos {j3^'^'\ /3^'^\ . . .} has converged. 
Figure 4: MAP estimation in bridge quantile and logistic regression. 
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ing the improper limits of variance-mean Gaussian mixtures: 



/■oo 

exp |— 2c~^ max(a6', 0)} = / (l){9 \ —av, cv) dv (15) 

Jo 

POO 

{1 + exp{e - fx})-^ = (PiOl fx-v/2,v) ppoiiv\0,l) dv, (16) 



where Pq{9) = + {q ~ ^) 9 is the check-loss function (Johannes et al. 



2009 



Li 



et al. , 2010 ); and Ppoi{v \ 0, 1) is the improper limit of the density function of a Polya- 
distributed random variable. These identities, and the corresponding distributional 



theory, are described in detail by Poison and Scott (2011c). 



These identies allow us to represent both the logistic and quantile-regression ob- 
jective functions as a conditionally Gaussian model, by introducing a second set of 
augmentation variables F = diag(7i, . . . , 7^) corresponding to the n individual terms 
in the likelihood. Details of the argument are presented in Appendix [B| with the 
algorithm itself summarized in Figure |4j 

5.3 A conjugate-gradient variation 

The bottleneck of Algorithms 2 and 3 lies in solving the linear system A^^^f3^^~^^^ = b^^^ 
for f3^^~^^\ where A and b change at every step. Solving this system exactly, however, 
is inefficient; an approximate solution will suffice in the M step, as long as it leads 
to an improvement in the observed-data objective function compared to the previous 
iteration. This will ensure that the sequence {Q{P^^\Q{f3^'^\ . . .} is monotonically 
decreasing. 

Such an approximate solution can be found using the conjugate-gradient algo- 



rithm; see, for example Golub and Van Loan (1996). We outline this modification 



in Figure |5j The conjugate- gradient algorithm's computational bottleneck involves 
a series matrix- vector multiplications, each of which is 0{p^). This offers two major 
advantages. 

1. If the algorithm proceeds exactly p steps, then an exact solution (subject to 
floating-point error) will be produced in 0{p^) total operations. But typically 
far fewer than p steps are necessary to reach a good approximate solution that 
improves the observed-data objective function. 

2. Many open-source libraries exist that implement parallel matrix- vector multipli- 
cation. This fact allows users to implement the algorithm quickly and painlessly 
in a multi-core processing environment, and can lead to a speed-up that is very 
nearly linear in the number of cores available. This compares favorably with 
coordinate-descent algorithms, which cannot be parallelized even in principle. 
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Algorithm 4: Bridge regression via conjugate gradient 


(linear, logit, and quantile) 


For iteration g = 1,2, . . . 




E Step: For linear regression 


with Gaussian error, set 7i = 1 and up- 


date A*^^) as in Algorithm 2. For logistic and quantile regression, 


update r^^') and A^^). 




For all Xj > Xmax, delete Xj and /3j from the model. 


M Step: First set A^^^ and b^^^ as in Algorithm 2 (linear regression) 


or Algorithm 3 (logistic or quantile regression). Then initialize 


the following inner-loop terms: 




:= /3(^-i) 




:= 6 - A(^)/3(»'°) 


d{g,o) 


— n9) 


C(fl,0) 


:= A(^)ci(,,o) 


"(s,o) 


'^[g, of (9,0) 
^[g,0f{9,0) 


^(5,0) 




While A(g /) > ^inin, increment I and set 






n9,i) — 


rig,l-l) - a^g,l-l)C^g,l-l) 


7(9,0 ■ = 


'-'{9,1/(9,1) 


""'{9,1-1/ (9,1-1) 


d{g,l) ■ = 


r(g,l) + J{g,l)d(g,l-1) 


(^{g,l) '■= 


A^'K,i) 


a(g,i) 


"^{9,0/(9,1) 

^[9,1)^(9,1) 


^(5,0 ■= 


0^{9,l)d{g,l) ■ 


Set (3^9) ^ i3{9,i) _ 




End when the sequence of estimates {j3^^\l3^'^\ . . .} has converged. 



Figure 5: A conjugate gradient algorithm for bridge regression. 
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6 Examples 



6.1 Diabetes data 

We first explore the Bayesian bridge estimator using the well-known data set on 



diabetes among Pima Indians, available in the R package lars (see, e.g. Efron et al. 



2004). The main data set has 10 predictors and 442 observations. Yet even for 
this relatively information-rich problem, significant differences emerge between the 
Bayesian and classical methods. 

We fit both the classical bridge (using Algorithm 2 and generalized cross valida- 
tion) and the Bayesian bridge (using Algorithm 1 and a default Gamma(2,2) prior for 
i>). Both the predictor and responses were centered, while the predictors were also 
re-scaled to have unit variance. At each step of the MCMC for the Bayesian model, 
we calculated the conditional posterior density for each /3j at a discrete grid of values. 

Figure |6] summarizes the results of the two fits, showing both the marginal poste- 
rior density and the classical bridge solution for each of the 10 regression coefficients. 
One notable feature of the problem is the pronounced multimodality in the joint pos- 
terior distribution for the Bayesian bridge. Observe, for example, the two distinct 
modes in the marginal posteriors for the coefficients associated with the TCH and 
Glucose predictors (and, to a lesser extent, for the HDL and Female predictors). In 
none of these cases does it seem satisfactory to summarize information about Pj using 
only a single number, as the classical solution forces one to do. 

Second, observe that the classical bridge solution does not coincide with the joint 
mode of the fully Bayesian posterior distribution. This discrepancy can be attributed 
to uncertainty in r and a, which is ignored in the classical solution. Marginalizing 
over these hyperparameters leads to a fundamentally different objective function, and 
therefore a different joint posterior mode. 

The difference between the classical mode and the Bayesian mode, moreover, need 
not be small. Observe, for example, the middle row in Figure |6| which shows the 
posterior distributions for the TC and LDL coefficients. These two predictors have a 
sample correlation of —0.897. The Bayesian solution concentrates in a region of MP 
where neither of these coefficients exerts much of an effect. The classical solution, on 
the other hand, says that both predictors should be in the model with large coefficients 
of opposite sign. 

It is impossible to say in any objective sense whether TC and HDL are both 
necessary, or instead are redundant copies of the same unhelpful information. It is 
highly surprising, however, that such a marked difference would arise between the 
full Bayes mode and the classical mode, and that this difference would fundamentally 
alter one's conclusions about two predictors out of ten. (The full Bayes posterior 
mean is, of course, different yet again.) Clearly an very important role here is played 
by the decision of whether to account for uncertainty in r and a. 
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Age 



Female 







LTG 



Glucose 




Figure 6: Marginal posterior densities for the marginal effects of 10 predictors in 
the diabetes data. Solid red line: penalized-likelihood solution with v chosen by 
generalized cross validation. Dashed blue line: marginal posterior mean for Dotted 
black line: mode of the marginal distribution for /3j under the fully Bayes posterior. 
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Table 1: Average sum of squared errors in predicting hold-out observations for 100 
different train/test splits on three real data sets. 

Prediction SSE 

Data set n p train/test LSE Bridge Bayes 

Boston housing 506 103 422/84 l288 THT 455~ 
Ozone 203 54 163/40 872 659 415 

NIR glucose 40 166 110/56 2791 2980 2375 

6.2 Out-of-sample prediction results 

Next, we describe the results from three out-of-sample prediction exercises involving 
the following benchmark data sets. 

Boston housing data: available in the R package mlbench. The goal is to predict 
the median house price for 506 census tracts of Boston from the 1970 census. As 
covariates, we used the 14 original predictors, plus all interactions and squared 
terms for quantitative predictors. 

Ozone data: available in the R package mlbench. The goal is to predict the con- 
centration of ozone in the atmosphere above Los Angeles using various envi- 
ronmental covariates. As covariates, we used the 9 original predictors, plus all 
interactions and squared terms for quantitative predictors. 

NIR Glucose data: available in the R package chemometrics. The goal is to pre- 
dict the concentration of glucose in molecules using data from NIR spectroscopy. 

For each data set, we created 100 different train/test splits, using the results from 
the training data to forecast the test data. For each train/test split we estimated f3 
using least-squares, the classical bridge (using Algorithm 2), and the Bayesian-bridge 
posterior mean (using Algorithm 1). In all cases we chose a = 0.5; centered and stan- 
dardized the predictors; and centered the response. For the classical bridge estimator, 
the regularization parameter u was chosen by generalized cross validation; while for 
the Bayesian bridge, a was assigned Jeffreys' prior and z/ a default Gamma(2,2) prior. 

We measured performance of each method by computing the sum of squared errors 
in predictor y on the test data set. Details of each data set, along with both the results 
and the train/test sample sizes used, are in Table [l] In all three cases, the posterior 
mean estimator outperforms both least squares and the classical bridge estimator. 

6.3 Simulated data with correlated design 

We conducted three experiments, all with p = 100 and n = 101, for a G {0.9, 0.7, 0.5}. 
Each experiment involved 250 data sets constructed by: (1) simulating regression 
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Table 2: Average sum of squared errors in estimating f3 for three different batches 
of 250 simulated data sets. 







LSE 


Bridge 


Bayes 


a 


= 0.5 


2254 


1611 


99 


a 


= 0.7 


1994 


406 


225 


a 


= 0.9 


551 


144 


85 



coefficients from the exponential power distribution for the given choice of a; (2) 
simulating correlated design matrices X; and (3) simulating residuals from a Gaussian 
distribution. In all cases we set a = r = 1. The rows of each design matrix were 
simulated from a Gaussian factor model, with covariance matrix V = BB' + I for 
a 100 X 10 factor loadings matrix B with independent standard normal entries. As 
is typical for Gaussian factor models with many fewer factors (10) than ambient 
dimensions (100), this choice led to marked multi-collinearity among the columns of 
each simulated X. 

For each simulated data set we estimated (3 with least squares, the classical bridge 
(using Algorithm 2), and the Bayesian bridge posterior mean (using Algorithm 1). 
Performance was assessed by the sum of squared errors in estimating the true value 
of (3. Convergence of both algorithms was assessed by starting from multiple distinct 
points in W and checking that the final solutions were identical up to machine and/or 
Monte Carlo precision. As before, for the classical bridge estimator, the regularization 
parameter u was chosen by generalized cross validation; while for the Bayesian bridge, 
cr was assigned Jeffreys' prior and v a Gamma(2,2) prior. 

Table |2] shows the results of these experiments. For all three choices of a, the 
posterior mean estimator outperforms both least squares and the classical bridge 
estimator. Sometimes the difference is drastic — such as when a = 0.5, where the 
Bayes estimator outperforms the classical estimator by more than a factor of 16. 



7 Generalizations 



Our method based on mixtures of Bartlett-Fejer kernels also leads to interesting 
extensions that encompass other known models. For example, the generalized Linnik 
Lab distribution 



see 



Devroye , 1996 ) has density defined by 



p{(3 I a,b) 



1 



(1 + 1/31 



a\b 



The Linnik distribution corresponds to 6 = 1. This class of priors nests many special 



cases including the double Pareto (Armagan et al. , 2010), meridian, and Cauchy. One 
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may simulate from the prior using the identity /3 = X'^Sa,o for c = b/a, where Safl is 
a symmetric stable and X is a random variable with density /(x) = e~'^' /r(l + b^^). 
This can also be generalized to the Mittag-Leffler distribution, where in lieu of Sa,o 
one instead uses Sa,i = CSa,o, with C an extra Cauchy-distributed scale factor. 
To see the connection with Bartlett-Fejer kernels, consider the Linnik distribu- 



tion with kernel p{t) = (1 + |t|) ^. Applying Theorem 3.1, the CDF of the mixing 
distribution is 



(6- l)t- 1 

For the general case. 



F{t) = l-p{t)+tp'{t) = l , 



Pit) 



a\b 



the conditions of the theorem are met, and with further algebra the mixing measure 
can be identified as 

If we make the transformation u = we have the density 



We simulate from this distribution in one line of code directly from the CDF. 
Indeed, we even know the CDF of the truncated version of this distribution, which is 
required in the slice sampler: 

8 Discussion 

This paper has demonstrated a series of results that allow practitioners to estimate 
both the posterior mode and the full joint distribution under the Bayesian bridge 
model. Our numerical experiments have shown: (1) that the classical mode, the full 
Bayes mode, and the full Bayes mean can often lead to very different summaries 
about the relative importance of different predictors; and (2) that using the posterior 
mean offers substantial improvements over the mode when estimating f3 or making 
predictions under squared-error loss. Both results parallel the findings of [Park and 



Casella (2008) and Hans (2009) for the Bayesian lasso. We have also shown how the 
bridge penalty can be applied both to logistic and quantile regression via a simple EM 
algorithm. The conjugate-gradient version (Algorithm 4) is also easily parallelized, 
lending further scalability to the method. 
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The existence of a second, novel mixture representation for the Bayesian bridge is 
of particular interest, given the difficulties of the normal scale-mixture approach for 
this model. It leads to an efficient Gibbs-sampling scheme that — by virtue of working 
directly with a bimodal mixing measure for each latent scale Uj — is capable of easily 
jumping between modes in the joint posterior distribution. It thereby avoids many 
of the difficulties associated with slow mixing in global-local scale-mixture models 
described by, for example, Hans (2009) and Scott (2010). 

A full comparison of our approach with the traditional scale-mixture MCMC would 
require a fast, robust method for sampling power-tilted or exponentially tilted alpha- 
stable random variables across a wide variety of input parameters. Constructing 
such a method, and benchmarking it against the Bartlett-kernel mixture approach, 
remains an active question for future research. 



A Proofs 



A.l Theorem [37T] 

Proof. Any function f(t) meeting the stated conditions is a characteristic function 



of Polya type. Applying the theorem of Dugue and Girault (1955), we can represent 



f{t) as the characteristic function of a known mixture of Fejer-de le Vallee Poussin 
(FVP) random variables. Specifically, we have the following integral identity for f{t): 



fit) 



dG{s) , 



where G{s) = l — f{s)+sf'{s) (see also Devroye, 1984). Thus if / is twice differentiable 
on (0,oo), the unnormalized mixing density is G'{s) = sf"{s) Thus 



kf{t) 



ksf"{s) ds 
ks^fis) ds, 

leaving a mixture with respect to a properly normalized Bartlett-Fejer kernel. 



1 



□ 
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A. 2 Corollary [372 



Proof. Transform s to = and apply Theorem to the kernel of the exponential- 
power density function. This yields 



1 
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exp(-|/3/rn) 



1 



/3 



I a) = auje~'^ + (1 ~ a)e^' 



I a) dco 



Simple algebra with the normalizing constants yields a properly normalized mixture 
of Bartlett-Fejer kernels: 



a 



2rr(l + 1/a) 



exp(-|/3/rn) 
p{u I a) 



1 



1 + a 







1 



/3 



1/q 



I a) duj 



□ 



B EM for logistic and quantile regression 
B.l M Step 

Both the logistic and quantile-regression objective functions can be expressed as 
pseudo posteriors of the form 

{n ^1 
-5^/(l/„x:/3)-5^|/3,/rr (17) 
i=l j=l J 

= p(z|/3)-p(/3|r), 

where Zi = yi — x[f3 for quantile regression, or Zi = yix[f3 for logistic regression (with 
the response yi coded as ±1). Moreover, both the pseudo-likelihood p{zi \ f3) and the 
prior p{Pj I r) can be represented as variance-mean mixtures of normals, using the 
identies in (jTsj) and (16): 



p{z, \/3)= (j){z, I /i, + K,^-\ a V') ^^(7.) (18) 

^0 

POO 

pWj \r)= m I 0, r'Xj') dP{\,) . (19) 
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For logistic regression, /x^ = and = 1/2; while for quantile regression /i^ = 
and tiz = 1 — 2q. (Other likelihoods, such as that corresponding to support-vector 
machines, can be encoded through different choices of these two constants.) Using 
this representation, write the complete-data log posterior distribution as 

1 " 2 

logp(/3 I 7, A,r,y) = Co(7, A,?/,r) - - ^ 7^ (2;, - /i^ - Ky-f^^) 

1=1 

-^E^^^^^-'^/^-'^A"')' (20) 
i=i 

for some constant Cq. 

First, we derive the M step for the quantile regression likelihood. Plugging in 
Zi = Hi — x'j/3 and collecting terms, we can write logp(/3 | 7, A, r, y) as 

{{y - fiA - Kz^-'} - Xf3)^T {{y - - k,^-'} - Xf3) - ^ {f3'Af3) . 

This is the log posterior under a normal prior f3 ~ N(0,r^A~^) and heteroskedastic 
Gaussian likelihood. Simple algebra then leads to the result. 

For logistic regression, let be the matrix with rows x* = yiXi. The kernel of 
the conditionally normal likelihood then becomes 

(X,/3 - - K^r^y V (X,/3 - - /€,7-i) . 

Hence it is as if we observe the n-dimensional "data" vector ^z'^ + i'^zl~^ in a regression 
model having design matrix X^, and we proceed by a similar argument to arrive at 
the result. 

B.2 E Step 



Factorize (20) function of f3 to yield 



^ n n 

logp(/3 I 7, A,r,?/) = ci(7, A,?/,r) - -^liizi- n^f + Ky^{zi - ji-,) 

i=l i=l 



i=i i=i 
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which depends hnearly upon 7^ and Xj. Therefore for the E step, we plug the following 
conditional moments into the complete-data objective function: 



A 



(9) 
j 

(9) 



ar 



I / e^i —11 



\y^ 



for logistic regression 
for quantile regression. 



The conditional moments for 7^ can be derived using an argument that is essentially 



identical to that following Equation ( 14 ) — that is, by differentiating the normal kernel 



under the integral of the mixture representation for the likelihood (c.f. Poison and 



Scott, 2011c). 
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